A supervised protein complex prediction method with network representation learning and gene ontology knowledge

Background Protein complexes are essential for biologists to understand cell organization and function effectively. In recent years, predicting complexes from protein–protein interaction (PPI) networks through computational methods is one of the current research hotspots. Many methods for protein complex prediction have been proposed. However, how to use the information of known protein complexes is still a fundamental problem that needs to be solved urgently in predicting protein complexes. Results To solve these problems, we propose a supervised learning method based on network representation learning and gene ontology knowledge, which can fully use the information of known protein complexes to predict new protein complexes. This method first constructs a weighted PPI network based on gene ontology knowledge and topology information, reducing the network's noise problem. On this basis, the topological information of known protein complexes is extracted as features, and the supervised learning model SVCC is obtained according to the feature training. At the same time, the SVCC model is used to predict candidate protein complexes from the protein interaction network. Then, we use the network representation learning method to obtain the vector representation of the protein complex and train the random forest model. Finally, we use the random forest model to classify the candidate protein complexes to obtain the final predicted protein complexes. We evaluate the performance of the proposed method on two publicly PPI data sets. Conclusions Experimental results show that our method can effectively improve the performance of protein complex recognition compared with existing methods. In addition, we also analyze the biological significance of protein complexes predicted by our method and other methods. The results show that the protein complexes predicted by our method have high biological significance.


Introduction
seed cores from the candidate cores. If the degree of connection between the protein and the seed core exceeds the threshold, add the protein to the core to obtain a protein complex. The methods mentioned above are all unsupervised learning methods. They predict protein complexes based on the topological information of the protein interaction network and cannot use the data of known protein complexes.
Recently, supervised learning methods have been successfully applied in protein complex prediction, which can use the information of known protein complexes to predict new protein complexes. Yu et al. proposed the SLPC [16] method. This method first obtains the characteristics of the protein complex from the weighted and unweighted network and trains the logistic regression model. Then finds the largest sub-graph from the PPI network as the core and uses the model to add auxiliary nodes to the center to obtain protein complexes. Zhu [17] et al. proposed a semi-supervised network embedding model. It first selects the key neighborhood node as a vertex attribute and obtains the first-order approximation of the vertex. Then it designs a three-layer GCN to calculate the second-order approximation of the vertex and optimizes the first-order approximation. Finally, the model is obtained by second-order approximation and used to identify protein complexes. Faridoon [18] et al. combined the support vector machine with the ECOC algorithm. In addition, the physical properties of amino acids and various topological information are used as features to predict protein complexes from the PPI network. These methods usually extract features from known protein complexes, train a classification model based on the features. Then use the trained classification model to predict protein complexes from the protein interaction network. However, the presence of a large amount of noisy data in the PPI network. In addition to the fact that many features exist only in specific networks and are not universal, leads to uncertainty in the classification model. Therefore, obtaining effective features from known protein complexes is the key to supervised learning algorithms. In addition, the abovementioned unsupervised learning methods and supervised learning methods are only explored in the yeast PPI network.
In this paper, we propose a protein complex prediction method based on supervised learning, which can fully use the information of known protein complexes. Moreover, to reduce the noise problem in the network and mine the biological information contained in the protein network, we introduce gene ontology (GO) knowledge [19] to construct a weighted PPI network. Furthermore, to further improve the performance of the protein complex prediction method, we use network representation learning to obtain the vector representation of the protein complex. We first use the GO knowledge to weight the PPI network and filter out the low-confidence relationship in the PPI network. Secondly, we extract the rich topological information of protein complexes as features and construct the training set based on the weighted and unweighted PPI networks. Train the supervised learning model SVCC according to the constructed training set, and use the SVCC model to predict candidate protein complexes from the PPI network. Then, we apply the network representation learning method to obtain the vector representation of each node in the PPI network and get the vector representation of the protein complex through the protein node representation. Finally, train the random forest model RF [20] according to the vector representation of the training set complex, and the candidate protein complexes are classified using the RF model. The protein complexes marked as positive examples are the final predicted protein complexes. To verify the performance of the proposed method, we conduct experiments on the yeast PPI network DIP [21] and the human PPI network HPRD [22]. Experimental results show that our method is superior to existing methods in predicting protein complexes in the PPI network. In addition, we are considering the particularity of the relationship between human proteins. We also analyzed the biological significance of the protein complexes predicted by our method and other methods. Experimental results show that our method can predict protein complexes with biological significance.

Methods
We detail our protein complex prediction method in this section. Our method mainly includes four parts: (1) weighted PPI network using GO knowledge; (2) generating supervised features for protein complex prediction; (3) the first stage of protein complex candidate recognition; (4) the second stage of final protein complex classification. Figure 1 shows the overall workflow of our method.

Weighted PPI network using GO knowledge
A protein interaction network is a basis for using computational methods to predict protein complexes. However, due to the limitations of technology and the flow characteristics of the protein interaction network, protein interaction data sets generated by high-throughput experiments often contain many a lot of noisy data [23][24][25]. There are two main types of noise relationships in PPI networks: false negatives and false positives. A false-negative relationship refers to an interaction relationship between two proteins that has not been discovered or documented in a database. A false positive relationship refers to the absence of an interactive relationship between two proteins, which is incorrectly recorded and stored in the protein interaction database due to experimental error. To solve this problem, researchers found that applying topological characteristics of protein-protein interaction networks or protein biological information such as gene expression data and gene ontology (GO) knowledge can improve the accuracy and reliability of protein-protein interaction data.
There are different ways to construct a weighted PPI network. For instance, we can calculate the protein similarity according to the topological relationship between proteins to obtain a weighted PPI network. We can also use some biological information, such as GO or gene expression, to calculate the credibility between proteins to get a weighted PPI network. In this paper, we combine the biological information of proteins with the topological information of the protein-protein interaction network to measure the degree of trust between proteins. Then construct a weighted proteinprotein relationship network. To calculate the topological similarity between proteins, we introduce the similarity metric HOCN proposed by Wang [14] et al., based on the Jaccard , s similarity coefficient. The main idea is to estimate the topological similarity metric between nodes based on the high-order public domain of two adjacent nodes. Jaccard , s coefficient similarity is a similarity measure proposed by Jaccard et al. The Jaccard , s coefficient similarity between two neighbor proteins v and u is defined by Eq. (1): where N (v) and N (u) represent the set of adjacent points of v and u respectively. N (v) ∪ N (u) represents the union set of adjacent points of v and u . CN (v, u) represents the set of common adjacency points of v and u , namely N (v) ∩ N (u) . |N (v) ∩ N (u)| and |N (v) ∪ N (u)| represent the number of common adjacent points and unions sets of v and u , respectively.
HOCN is proposed based on the Jaccard similarity coefficient, and its definition is shown in Eq. (2). The topological similarity between protein v and protein u is determined by not only the Jaccard similarity coefficient but also the degree of connection between their common neighborhood and edge ( v, u ). The degree of connection between the common neighborhood and the edge ( v, u ) is defined as CNS , as shown in Eq. (4).
Gene Ontology GO is one of the most comprehensive ontology databases in bioinformatics. GO provides a series of GO terms to describe the characteristics of gene products, mainly including three aspects: biological process (BP), cell component (CC), and molecular function (MF). If two proteins have more GO terms in common, the more specific information the GO terms describe, and the higher the biological semantic similarity between the two proteins. In this paper, we calculate the biological similarity JCS * sim(v, u) between protein v and u according to the number of GO terms and the number of annotated proteins in GO terms as follow.
Protein v and u are both annotated by multiple different GO terms. C(v, u) represents the GO term set in which protein v and u are annotated by the same GO term. S i (v, u)(1 ≪ i ≪ n) represents the set of proteins annotated by each GO term in the GO terms shared by proteins v and u . Smax represents the maximum number of proteins annotated by a GO term among all GO terms.
To calculate the similarity of two proteins v and u , we combine the topological similarity and biological similarity between proteins, and its definition is shown in Eq. (7).

Generating supervised features for protein complex prediction
Extracting key features from protein complexes indicates that protein complexes are crucial in our research. So far, a lot of research has been done in this area. We designed 16 features extracted from weighted and unweighted networks to describe protein complexes. A detailed description of the characteristics is shown below.
1. Density: Density is an essential feature in the network and has been widely used in protein complex identification. For an unweighted graph, if G = (V , E) has |E| edges, the density is defined as |E| divided by the theoretical maximum possible number of edges in the graph |E| max , |E| max = |V | × (|V | − 1)/2 . For a weighted graph, set G = (V , E, W ) , the weight of the edge (v, u) is w(v, u) , and its density is defined as shown in formula (8).
2. Degree statistics: For unweighted graphs, the node degree is defined as the number of neighbor nodes of the node. For weighted graphs, the node degree is defined as the sum of the weights between the node and its connected nodes. We choose the maximum, average and median of the node degree of the weighted graph and unweighted graph as the sub-graphs features. 3. Edge weight statistics: Edge weight is also an essential feature of weighted networks.
It is similar to node degree, and both describe the characteristics of edges in the network. We choose the average and variance of all edge weights in the sub-graphs as the features of the sub-graphs. 4. Degree-related attributes: Degree-related attributes can test the connectivity between a node in the sub-graphs and its neighbor nodes. Each node is defined as the average number of connections of the nearest neighbor nodes of the node, that is, the average degree. We choose the average and variance of the related attributes of the node degree in the sub-graphs as the characteristics of the sub-graphs. 5. Modularity: Modularity indicates the tightness of node connections in the subgraphs. For a weighted graph G = (V , E, W ) , any sub-graph SG ∈ G , let the sum of the weights of the inner edges of SG be d in . The sum of the weights of SG and the external node connecting edge is d out . Then the SG modularity M SG is defined as shown in formula (9).
6. Clustering coefficient: For unweighted graphs, the clustering coefficient of node v is the ratio of the number of triangles to the number of triangles that may be formed. Its definition is shown in the formula (10).
T (v) represents the number of triangles passing through node v . N (v) represent the set of adjacent points of node v . We choose the variance of the clustering coefficient in the unweighted graph as its clustering coefficient feature. The definition of the clustering coefficient in the weighted graph is shown in formula (11).
where k v represents the number of neighbor nodes of node v, and w v, j represents the weight of the edge between nodes v and j . w(v) represents the sum of the weights of the edges between node v and all adjacent nodes. We choose the average and maximum weighted graph clustering coefficient values as their clustering coefficient characteristics.

The first stage of protein complex candidate recognition
This section proposes a supervised learning method SVCC for identifying protein complexes from protein interaction networks. The supervised learning method SVCC mainly includes four steps: (1) the first stage recognition model training; (2) sub-graphs selection; (3) sub-graphs expansion; (4) sub-graphs filtration. The overall flow chart of SVCC is shown in Fig. 2.

The first stage recognition model training
In this study, we apply the support vector machine algorithm SVC [26] to predict protein complexes from the protein interaction network. SVC is a support vector machine algorithm mainly used to solve classification problems. The main idea of SVC is to construct an optimal decision hyperplane in the feature space to maximize the distance between the two types of samples closest to the plane on both sides of the plane. Thus, it provides w v, j good generalization ability for the classification problem. Compared with other classification methods, SVC requires relatively fewer sample data. Because SVC introduces a kernel function, SVC can easily cope with high-dimensional or nonlinear data samples.
We extract 16 topological features of protein complexes in the network from positive and negative examples, namely the feature vectors of protein complexes. It combines them to obtain the training set. After constructing the training set, we use the training set as input data to train the SVC model. We conducted parameter tuning tests on the main hyperparameters C and Degree in the SVC model. Based on the preliminary experimental results, we chose the hyperparameter C and Degree are 3 and 4 in our experiments, respectively. The various parameters of the SVC model used in this article are shown in Table 1.

Sub-graphs selection
We use the trained SVC model to predict protein complexes from the protein interaction network. We first use the Clique [27] algorithm to search for the largest sub-graph in the protein interaction network. The Clique algorithm is based on the depth-first search algorithm for the largest group in the network. We choose the sub-graph with the number of proteins greater than or equal to 3 as the initial sub-graph. Since the initial sub-images may overlap, we need to filter the initial sub-graphs. We use the trained SVC model to determine the probability that each sub-graph is an actual complex and arrange them in descending order of likelihood. For any sub-graph C i , calculate the number of overlapping proteins between it and the sub-graph C k whose probability is lower than it. If the number of overlapping proteins exceeds the given threshold α , the sub-graph C k is filtered out. Repeat the above process to form the final initial set of subgraphs. For the filtering threshold α, since the number of proteins in the initial subgraph we obtained is greater than or equal to 3, if the threshold α is set to 1, it will lead to too many filtered subgraphs. If the threshold α is greater than 2, many subgraphs with a protein number of 3 cannot be correctly discriminated. Therefore, we set the threshold α = 2. In the sub-graph selection stage, the initial sub-graph structures we obtained by the Clique method are usually relatively simple. Many sub-graph structures contain only 3 proteins. During sub-graph selection, filtering strategy based on the number of proteins in common between the candidate sub-graphs is simple and efficient.

Sub-graphs expansion
For any sub-graphs C i , the set of adjacent points is N (C i ) , and any node v in N (C i ) is selected to join the subgraph C i . Then, the trained model is used to determine the probability that {C i ∪ v} is a true compound, and the node v with the highest probability increase is selected and added to the subgraph C i . Repeat the above process until there is no node in N (C i ) so that after it joins C i , the probability that the subgraph C i an actual complex increase. At this point, the subgraph C i is expanded to form a candidate sub-graph.

Sub-graphs filtration
Candidate sub-graphs may also overlap, so we need to filter the candidate sub-graphs. As in the first step, we use the trained model to determine the probability that the candidate sub-graphs are an actual complex and arrange them in descending order of likelihood. After sub-graph expansion operation, the sub-graph structures become much more complex. It is more appropriate to use the overlap ratio at this stage to determine whether the sub-graphs overlap. For any candidate sub-graphs C i , we calculate the overlap ratio overlap(C i , C k ) between it and the candidate subgraph C k . If the overlap rate exceeds the set overlap threshold β, we merge the two candidate subgraphs. Otherwise, the candidate sub-graphs C k is filtered out. Repeat the above process to obtain candidate protein complexes. We test the optimal value of the overlap threshold β in the interval 0 to 1. Based on the preliminary experiments, we set β as 0.8 in this study.

The second stage of the final protein complex classification
We apply the supervised learning method SVCC to predict candidate protein complexes from protein interaction networks. However, there are a lot of noisy data in the PPI network, and many complex features only exist in specific networks, which are not universal. In addition, the insufficient number of known complexes leads to uncertainty in the supervised learning model. To solve these problems, we choose the network representation learning method to obtain the vector representation of the protein. Then calculate the protein complex vector based on the protein vector representation as to the feature of the protein complex, and train the random forest model RF based on the feature vector. We use the trained RF model to judge whether the candidate protein complex identified by SVCC is a natural protein complex. At the same time, classify the candidate protein complex to further improve the performance of protein complex recognition.

Network representation learning
The network representation learning method automatically learns the distributed representation of nodes based on the adjacency information and network topology. Compared with the traditional method of obtaining the topological characteristics of nodes in the network, the network representation learning method can represent the nodes in the protein-protein interaction network as a low-dimensional vector. It can extract the hidden information in the protein-protein interaction network, including the diversity of the connections between protein nodes. We use node2vec [28] to obtain the vector representation of nodes in the PPI network. Node2vec can automatically learn the vector representation of nodes and maximize network and node structure information retention. Node2vec uses a random walk and alias sampling strategy to obtain the structure information of nodes. In addition, a protein complex is a set of proteins. We calculate the vector of protein complex according to the vector representation of the protein.
The calculation method is shown in formula (14).
where ϕ i (i = 1, 2, . . . , m) is the vector representation of protein nodes in a protein complex Z is the matrix composed of the vector representation ϕ i of protein nodes in a protein complex, d is the dimension of ϕ i , and Z ., j is the j-th column in matrix Z.

Random forest model
The random forest model [20] was used to classify the candidate protein complexes obtained by SVCC. Random forest uses multiple classification trees to distinguish and (13) organize data, and it is a kind of cluster classification model. While classifying the data, it can also give a score of the importance of each variable and evaluate the role of each variable in the classification. The random forest model uses a random method to build a forest. The forest comprises many decision trees, and there is no correlation between each decision tree. When new sample data enters, each decision tree in the random forest is judged separately. For classification problems, voting is usually used. The category with the most votes is used as the final model output. Compared with other classification methods, the random forest can handle high-dimensional data without feature selection. It has good performance for extensive sample data and can also understand variables importance. In addition, the introduction of randomness makes random forests have an excellent anti-noise ability.

Candidate protein complex classification
We first obtain the vector representation of the protein in the protein interaction network through the network representation learning method Node2vec. Then calculate the average value of the protein vector representation in the protein complex as the vector of the protein complex. We calculate the vector representation of the protein complex in the positive and negative examples as the feature vector of the protein complex. Subsequently, combine the feature vector of the positive and negative samples to obtain the training set. After constructing the training set, we use the training set as input data to train the RF model. Then, we also use the network representation learning method Node2vec to calculate the feature vector of the candidate protein complex identified by SVCC as the test set. We use the trained RF model to classify the feature vector of the test set. Then, it will be marked as a positive example of the protein the complex is the final predicted protein complex.

Datasets
In this study, we use the human protein interaction network and yeast protein interaction network as experimental data. The human protein interaction network is downloaded from the Human Protein Reference Database (HPRD) [22]. The yeast protein interaction network comes from the extensive yeast data set DIP [21]. For these two kinds of PPI networks, we removed the repetitive and self-connected protein relationships in the network. Finally, we obtained the basic information of the two protein interaction networks, as shown in Table 2. The standard human protein complex data set we use is also downloaded from HPRD, including 1514 human protein complexes. The standard yeast protein complex data set comprises four common yeast standard The positive examples are the standard human protein complex data set and the standard yeast protein complex data set described above. In addition, the standard yeast protein complex data set comprises four common yeast standard protein complex data sets. Therefore, the data set will contain protein molecules that do not exist in the DIP network. The protein complexes predicted from the PPI network will not have protein molecules present in the protein interaction network. Therefore, when experimenting on the DIP network, it is necessary to filter out the protein molecules in the positive protein complex that do not belong to the DIP network. Our negative example is generated by randomly selecting nodes from the PPI network, and its size is consistent with the positive sample. In particular, the number of protein molecules contained in the protein complex in both the positive and negative examples is greater than or equal to 3.

Evaluation metrics
We used four performance evaluation indexes to evaluate the predicted protein complexes: precision,recall, F − score,P − value.
Suppose that B = b 1 , b 2 , . . . , b m and P = {p 1 , p 2 , . . . , p n } represent the standard protein complex set and the predicted protein complex set, respectively. If selecting a real protein complex b ∈ B and a predicted protein complex p ∈ P , we can calculate their similarity, namely neighborhood affinity score NA as Eq. (15).
where V b and V p represent the collection of protein rmolecules in complexes b and p , respectively. V b ∩ V p represents the number of proteins shared in the two protein complexes.
Generally speaking, if NA(b, p)> 0.25, the two protein complexes are considered to be matched. Let P and B denote the set of predicted protein complexes and standard protein complexes, respectively. Let N cb denote the number of standard protein complexes that match at least one predicted protein complex. N cp denote the number of predicted protein complexes that match at least one standard protein complex. Then the definition of precision and recall are shown as Eq. (16) and (17).
F − score is defined as the harmonic average of precision and recall , that is, a reasonable mixture of precision and recall , and its definition is shown as Eq. (18).
In addition, in this article, we also use the biological process annotations in the gene ontology to analyze the biometric significance of protein complexes identified by different methods. The biological statistical significance of a protein complex can be marked by its biological function. Calculated by hypergeometric distribution, the definition is shown as Eq. (19).
where |V | represents the number of protein nodes of the corresponding entire species. C represents the predicted protein complex, which contains k proteins and is annotated by the gene ontology functional group F . The smaller the P − value of a protein complex is, the more likely it is to be annotated with the same function, and the more likely it is to be a true complex.
Our proposed method is based on supervised learning. To evaluate our method using an independent testing set, we follow the previous work to evaluate the proposed method in a five-fold cross-validation experimental setting.

Results and discussion
This section introduces our comparative experiment in detail, mainly composed of three parts. In the first part, we compare the performance of our method with several existing protein complex prediction methods. The second part analyzes the impact of different factors on the experimental performance, including classification models, network representation learning methods, and feature sets. In the third part, the biological significance of the predicted complexes is evaluated and discussed.

Comparison results with other methods
To validate the effectiveness of our method in predicting protein complexes, we compared our method with MCODE [2], COACH [6], CMC [3], ClusterONE [8], GANE [15], EWCA [14], SLPC [16], and SVCC only on two protein interaction networks of DIP and HPRD. To compare these methods as fair as possible, we use a five-fold-crossvalidation experimental setting to identify protein complexes. We divided the standard set of protein complexes for DIP and HPRD into five parts as {C 1 , C 2 , C 3 , C 4 , C 5 } . In each crossover experiment, we use 4 of them as the training set and train the SVC model and the random forest model to recognize the complexes in the network. Since the identified complexes may contain the complexes of the training set, we remove the complexes that overlap with the training set to obtain R 1 , where the overlap threshold is set to 0.9, calculated by the formula (13). After five rounds of such experiments, we combined the set of five complexes identified as {R 1 , R 2 , R 3 , R 4 , R 5 } . And remove the complexes in which the overlap ratio is greater than 0.6. The remaining protein complexes are then taken as the final result and evaluated using a standard collection of protein complexes. The MCODE and ClusterONE methods are processed (18) by Cytoscape [33]. The parameters of the other methods are set according to their authors' recommendations. Our method used node2vec to learn the vector representation of proteins on the protein interaction network. The parameters of node2vec set to q = 1, p = 8, dimensions = 64. The comparison results between our method and other methods are shown in Table 3. Table 3 shows the results of our method compared with other methods on the yeast PPI network DIP and the human PPI network HPRD. When we use the yeast PPI network DIP as the experimental network, our method achieves the highest F-score of 0.5539, which is much higher than the unsupervised learning methods. At the same time, the F-score obtained by using only the SVCC method is 0.5231, which is slightly lower than the F-score of 0.5249 of the supervised learning method SLPC. After using the RF model to classify the candidate protein complexes identified by SVCC, the experimental performance improved by 3%. Using the human PPI network HPRD as the experimental network, our method achieves the highest F-score of 0.6268. Compared with unsupervised learning methods, our method improves by at least 15%, except for EWCA. At the same time, it is also an increase of nearly 10% compared with EWCA. Compared with the supervised learning method SLPC, our method improves by about 8%. When we only use the SVCC. method, the obtained F-score is 0.5213. After using the RF model to classify the candidate protein complexes identified by SVCC, the experimental performance improved by about 10%. It can be seen from the above experiment that using a trained RF model to classify candidate protein complexes predicted by SVCC can significantly improve the performance of the experiment. In summary, our method achieves good performance on both the yeast PPI network and the human PPI network. Especially in the human PPI network, our method is significantly better than other methods. Therefore, our method is superior to the existing protein complex prediction methods. We also note that the precision is much higher than recall on DIP network, but vice-versa on HPRD network. This maybe because the scale of HPRD is much larger than that of DIP (as seen in Table 2). It makes some methods can identify a large number of complexes on HPRD. We have supplemented the number of complexes identified in Table 3. From Table 3, we can see that the six methods of COACH, CMC, EWCA, SLPC, SVCC and our method can identify more than 1000 protein complexes on the HPRD network, which is much higher than the number of standard protein complexes, namely |P| ≫ |B| in Eqs. 16 and 17. This leads to the precision is less than the recall on HPRD network.

Comparison with other classification models
We use the RF model to classify the candidate protein complexes obtained by the SVCC method. The above experiments show that the RF model can significantly improve the experimental performance. To further verify the effectiveness of the random forest model RF, we also chose to train other supervised learning models to classify candidate protein complexes.
We trained naive Bayes (Bayes), logistic regression (LR), KNN, XGBoost, AdaBoost, and gradient boosted tree GBDT six supervised learning models and compared their experimental results with the random forest model RF. We conducted parameter tuning experiments on these supervised learning models. We selected the best test parameters which are shown in Table 4. The results of comparing the RF model and the other six supervised learning models on the yeast PPI network DIP and the human PPI network HPRD are shown in Figs. 3 and 4.
It can be seen from Fig. 3 that on the yeast PPI network DIP, the XGBoost model achieves the highest F-score of 0.5644. Bayes and RF achieve the second and third highest F-scores, 0.5555 and 0.5539, respectively. The XGBoost model is about 1% higher than the RF model. It can be seen from Fig. 4 that on the human PPI network HPRD, the RF model achieves the highest F-score of 0.6268. The XGBoost model also achieved a high F-score of 0.6035. But the Bayes model only achieved an F-score of 0.5036. In summary, the RF model achieves the best performance on the human PPI network HPRD and good performance on the yeast PPI network DIP. Therefore, we finally choose the RF model to classify the candidate protein complexes obtained by the SVCC method.

Influence of different network representation learning methods
In this study, we applied the network representation learning method node2vec to obtain the vector representation of the protein. Then, according to the protein vector representation, the protein complex vector is calculated as the feature of the protein complex. Finally, the candidate protein complexes are classified through the feature vector training model. To verify the effect of node2vec for our method, we also evaluate four other network representation learning methods, including DeepWalk [13], HOPE [34], LINE [35], and SDNE [36]. Most of the parameters of these five network representation learning methods are set to default values, and only a few parameters with significant influence are tested (such as dimension, etc.). The specific parameter settings of the five network representation learning methods are shown in Table 5. The comparison between node2vec and the other four network representation learning methods on the yeast PPI network DIP and the human PPI network HPRD is shown in Figs. 5 and 6. As shown in Figs. 5 and 6, our method achieves higher F-scores than other network representaion learning methods on both the yeast PPI network DIP and the human PPI network HPRD when using the Node2vec method to obtain the vector representation of the protein. It can be seen that on learning methods on both the yeast PPI network  DIP and the human PPI network HPRD when using the Node2vec method to obtain the vector representation of the protein. It can be seen that Node2vec achieves the highest F-score on both DIP and HPRD networks. We also note that Deep Walk outperforms Node2vec in precision, especially for the HPRD network. DeepWalk randomly and uniformly selects nodes in the network during random walks [13]. Node2vec uses two Table 5 Parameter settings of five network representation learning methods  The impact of different network representation learning methods on the experimental performance of the HPRD network parameters p and q to control the direction of the sampling during random walks [28]. In other words, Node2vec uses a flexible and biased random walk sampling strategy to trade off the local and global structure of the network. Compared to DIP network, the scale of HPRD network is much larger. From Fig. 6, the results suggest that Deepwalk can achieve higher precision on larger networks such as HPRD network. But Node2vec can achieve higher recall and F-score than Deepwalk. Overall, Node2vec achieve the best performance among the five methods on both DIP and HPRD networks.

The impact between binary and multiple classification models
The training set data we selected when training the SVC model and the RF model is the same, including positive and negative examples. The positive data is a standard protein complex data set. The negative data is generated by randomly selecting nodes from the PPI network according to the ratio of the standard protein complexes. Many researchers used multiple classification labels to train supervised learning models in the existing research on protein complex prediction based on supervised learning. To verify the effectiveness of the two-class training set data selected, we also used the three-class training set data to train the supervised learning model for experiments on the DIP and HPRD protein interaction relationship network. to identify protein complexes from the PPI network. This is mainly because the possibility of the complexes identified by COACH being real complexes is higher than that of negative sample data but lower than that of positive sample data, which can effectively increase the richness of training sample data. And in order to ensure the accuracy of the experiment, the protein complexes matching the positive example were filtered out. The results of the experimental comparison of the two-class and three-class data on the yeast PPI network DIP and the human PPI network HPRD are shown in Table 6. As shown in Table 6, on the yeast PPI network DIP and the human PPI network HPRD, the experimental performance when we train the SVC model with the binary training set data is much better than the three-category training set data. Therefore, we select the binary classification training set data to train the SVC model. When training the RF model on the yeast PPI network DIP, the F-score obtained by selecting the binary training set data is 0.5539. The F-score obtained from the three-class training set data is 0.5589, the difference is less than 1%. When training the RF model on the human PPI network HPRD, the F-score obtained from the two-category training set data is 0.6268. The F-score obtained from the three-category training set data is 0.6096. The performance of the two-category data is about 2% higher than that of the three-category data. Therefore, we select the two-category training set data for the RF model. In summary, to obtain good experimental performance, we choose the two-class training set data when training the SVC and RF models.

The biological significance of predicted protein complexes
In this section we validate the biological significance of protein complexes based on Gene Ontology GO. In previous complex identification methods, many researchers used P − value to evaluate the biological significance of protein complexes. The P − value indicates the possibility of co-occurring proteins having a common function. If the identified protein complex has a lower P − value , it indicates that the co-occurrence of proteins in the complex is not accidental. The lower the P − value , the higher the biological significance of the complex, and the more likely it is a significant complex. This paper uses GO term enrichment analysis to determine whether members of a predicted complex have a likely common function. We used LAGO [37] to calculate P-values for protein complexes for functional enrichment analysis and set all parameters in LAGO to default values. LAGO is a fast tool improved based on GO Term Finder [37], which can find important GO terms in the gene name list and calculate the P − value through hypergeometric distribution. Some complexes could have low p-values for multiple different GO terms. In this paper, we chose the best (lowest p-value) GO terms for each complex. Tables 7 and 8 present the ten protein complexes with lower P − value that we identified on both the DIP and HPRD PPI networks. Moreover, these protein complexes have a high degree of matching with standard protein complexes (calculated by the formula (13)), suggesting that those with low P − value are likely to be genuine protein complexes.